================================================================================
This analysis explores 11 chemical properties of white wine for 4,898 wine samples, as well as a rating of the quality of the wine on a 0-10 scale.
“Fixed Acidity” was measured by the amount of tartaric acid in the wine in grams per cubic decimeter. The distribution of this variable in the dataset is normal, with a mean of 6.855 g/dm^3 and a median of 6.8 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
“Volatile Acidity” was measured by the amount of acetic acid in the wine in grams per cubic decimeter. The overall distribution of this variable is slightly positively skewed. The range within the 1st quartile was 0.13 g/dm^3, while the range within the 4th quartile was 0.78 g/dm^3 with 50% of the data falling between 0.27 g/dm^3 and 0.39 g/dm^3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The amount of citric acid in the wine was measured in grams per cubic decimeter. The distribution of this variable in the dataset is slightly positively skewed, with a mean of 0.3342 g/dm^3 and a median of 0.3200 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
“Residual Sugar” is a measure of the amount of sugar remaining after fermentation stops in grams per cubic decimeter. It’s rare to find wines with less than 1 g/dm^3 and wines with greater than 45 g/dm^3 are considered sweet. The outliers in this dataset cause the distribution to be non-normal with the range within the 4th quartile of 55.9 g/dm^3 and 75% of the data falling below 9.9 g/dm^3. After transforming this data to a log10 scale, it appears the data is bimodal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
“Chlorides” were the measure of the amount of sodium chloride (salt) in the wine in g/dm^3. This dataset has a positively skewed distribution with 75% of the data falling below 0.05 g/dm^3 and 25% of the data falling between 0.05 g/dm^3 and 0.346 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Free Sulfur Dioxide was measured in mg/dm^3. The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of wine. This variable in the dataset is positively skewed because of larger numbers within a range of 243 mg/dm^3 in the 4th quartile of this dataset. The median is 34 mg/dm^3 and the mean is 35.31 mg/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
“Total Sulfur Dioxide” is the amount of free and bound forms of S02. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the smell and taste of wine. Outliers skew this distribution in the positive direction with a median of 134 mg/dm^3 and a mean of 138.4 mg/dm^3. 50% of the data falls between 108 mg/dm^3 and 167 mg/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
“Density” is determined by the alcohol and sugar content of the wine in grams per cubic centimeter. 50% of the data falls beween 0.9917 g/cm^3 and 0.9961 g/cm^3. The median is 0.9937 and the mean is 0.9940. This distribution in slightly positively skewed because of outliers falling between 0.9961 g/cm^3 and 1.039 g/cm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
“pH” describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. This variable is normally distributed in this dataset with a mean of 3.188 and a median of 3.180.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
“Sulphates” are a measurement of the amount of postassium sulphate in g/dm^3.
Potassium sulphate is a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. This variable is slightly positively skewed, with a median of 0.47 g/dm^3 and a mean of 0.4898 g/dm^3. 50% of the data falls between 0.41 g/dm^3 and 0.55 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
“Alcohol” is a measurement of the percentage of alcohol content by volume found in the wine. This variable is positively skewed with 75% of the data falling between 8% and 11.4%, while the remaining 25% of data fell between 11.4% and 14.2%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
This dataset includes 4,898 observations of wine with 11 numerical input variables of chemical properties (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfer dioxide, total sulfur dioxide, density, pH, sulphates and alcohol) and 1 output variable of quality rating on a scale of 0-10, which I transformed into a new variable “quality.bins” to group the quality into 3 factor levels.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The guiding question of this dataset is “Which chemical properties influence the quality of white wines?”, so the main feature of interest is the quality rating.
I think Chlorides, Density, Alcohol, Volatile Acidity and Free Sulphur Dioxide levels might show a relationship to the quality rating of each wine. Fixed Acidity and pH might also contribute slightly to the quality ratings.
Yes, I decided to create a new variable called “quality.buckets” to visualize the correlations found using Spearman’s method in the bivariate analysis below.
I also created the variable “sugar.alcohol.ratio” during my multivariate analysis to see how this relationship affected the density and quality ratings.
The residual sugar distribution was positively skewed by outliers, so tranforming the x-axis to a log10 scale uncovered a bimodal distribution among the majority of the data.
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
Since “quality” is ordinal, I used Spearmans correlation to compare each variable to the quality ratings. The highest correlations were:
##
## Spearman's rank correlation rho
##
## data: quality and chlorides
## S = 2.5743e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.3144885
##
## Spearman's rank correlation rho
##
## data: quality and density
## S = 2.6406e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.348351
##
## Spearman's rank correlation rho
##
## data: quality and alcohol
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4403692
As quality increases, alcohol content increases, while density and chlorides decrease
The strongest relationship I found to the feature of interest (quality) was alcohol. Interestingly, when looking at the wine in my house, the only componenet on the labels that I could use for comparison when shopping for wine is the alcohol % by volume. Looking at the mean and median alcohol content for each quality rating, a good rule of thumb is a good quality wine will have approximately 11% alcohol by volume or more.
## # A tibble: 7 x 4
## quality alcohol_mean alcohol_median n
## <int> <dbl> <dbl> <int>
## 1 3 10.34500 10.45 20
## 2 4 10.15245 10.10 163
## 3 5 9.80884 9.50 1457
## 4 6 10.57537 10.50 2198
## 5 7 11.36794 11.40 880
## 6 8 11.63600 12.00 175
## 7 9 12.18000 12.50 5
The top 2 strongest relationships I found among the remaining features were:
Looking at the plots, it looks like at around 0.993 Density, residual sugar increases slowly, and then after 0.993 the rate of increase is larger.
Alcohol content on the other hand decreases at a high rate and then tapers off around this same density threshold.
interactions between features?
It was clear in the first plot that higher quality wines have a low density with a high alcohol % by volume. As well, this relationship is maintained with a low residual sugar content. Overall, quality wines have a smaller sugar to alcohol ratio for a lower density level.
The highest correlation with the quality ratings of white wine was the alcohol % by volume. Higher quality wines have more than 11% alcohol by volume.
Density was also correlated with quality. Higher quality wines had a lower density compared to bad and average wines. When compared in combination to alcohol levels, the relationships are maintained. Quality wines have a lower density with a higher alcohol content.
The highest correlation was between residual sugar and density. Adding in a comparison with alcohol content, it is clear that the optimal ratio for a low density wine is lower residual sugar content to higher alcohol content. ——
The white wine dataset contains 11 chemical properties of 4,898 samples of wine, as well as an average quality rating by 3 experts. I started by looking at the distributions of each chemical component and then compared each component to the quality ratings. The residual sugar distribution had to be transformed to a log10 scale to be able to see a closer to normal distribution, which interestingly uncovered a bimodal distribution. All other distributions were normally distributed so I was able to use pearsons method to calculate correlations between chemical components and spearmans method to calculate correlations to the ordinal “quality” variable. As well, I compared chemical properties to eachother and visualized how those relationship affected the quality ratings.
There was a correlation between alcohol % by volume to the quality of wine as well as a correlation between the density of the wine to the quality rating. The highest correlation between chemical components was density vs residual sugar.
In comparing alcohol, residual sugar, density and quality in various multivariate plots, I found that the ideal ratio for a low density and therefore high quality wine, was lower residual sugar levels to higher alcohol levels.
The biggest limitation of this dataset is the disproportionate representation of each quality level rating. There were no samples of 0, 1, 2 or 10 rated wines. Additionally, there were significantly fewer examples of wines rated at a 3, 4, 8 and 9. The majority of the data was for average rated wines, which limits the ability to see trends between bad, average, good, and excellent wines. By creating 3 levels (“bad”, “average”, and “good”) I was better able to visualize difference among each category. Another limitation is that this dataset is from variants of the Portuguese “Vinho Verde” wine. Further interesting analysis would include comparisons of different regions/countries, types of grapes used and prices these wines are sold at.